vi        Contents

1.5.8

Sequence Length Distribution

30

1.5.9

Sequence Duplication Levels

31

1.5.10 Overrepresented Sequences

31

1.5.11 Adapter Content

32

1.5.12 K-mer Content

33

1.6 PREPROCESSING OF THE FASTQ READS

34

1.7 SUMMARY

45

REFERENCES

46

Chapter 2        Mapping of Sequence Reads to the Reference Genomes

49

2.1 INTRODUCTION TO SEQUENCE MAPPING

49

2.2 READ MAPPING

55

2.2.1

Trie

56

2.2.2

Suffix Tree

56

2.2.3

Suffix Arrays

57

2.2.4

Burrows–Wheeler Transform

58

2.2.5

FM-Index

62

2.3 READ SEQUENCE ALIGNMENT AND ALIGNERS

63

2.3.1

SAM and BAM File Formats

65

2.3.2

Read Aligners

70

2.3.2.1 Burrows–Wheeler Aligner

71

2.3.2.2 Bowtie2

75

2.3.2.3 STAR

76

2.4 MANIPULATING ALIGNMENTS IN SAM/BAM FILES

79

2.4.1

Samtools

79

2.4.1.1 SAM/BAM Format Conversion

79

2.4.1.2 Sorting Alignment

80

2.4.1.3 Indexing BAM File

80

2.4.1.4 Extracting Alignments of a Chromosome

81

2.4.1.5 Filtering and Counting Alignment in SAM/BAM Files

81

2.4.1.6 Removing Duplicate Reads

82

2.4.1.7 Descriptive Statistics

83

2.5 REFERENCE-GUIDED GENOME ASSEMBLY

83

2.6 SUMMARY

85

REFERENCES

86